Week 02: Lexical Analysis & Finite Automata
Lexical Analysis
Token Class (or Class)
Classify program substrings according to role (token class), class corresponding to sets of strings:
- Identifier : string of letters or digits, starting with a letter
- Integer: a non-empty string of digits
- Keyword: else, if, begin, ...
- Whitespace: a non-empty sequence of blanks, newlines, tabs
Goal of Lexical Analysis
- Definition
- lexeme: A lexeme is a sequence of characters that matches the pattern for a token.
- token: A token is a pair consisting of the token name and the value.
- Concept
- Parttition the input string into lexemes.
- Identity the token of each lexeme.
- Communicate tokens to the parser.
- Input: Program Substrings
- Output: Tokens
- Sample Input:
string (foo=42)
- Sample Output:
<class, "string">, <'('>, <ID, "foo">, <Operator, "=">, <"Int", "42>, <')'>
- Remark
- Left-to-Right scan => lookahead required.
Regular Languages
Regular Expression
- Concept: Regular expressions (syntax) specify regular languages (set of strings).
- Two base cases
- Single Character: $'c' = \{ "c" \}$
- Empty String: $\epsilon = \{ "" \}$
- 3 compound expressions
- Union: $A + B = \{a\ |\ a \in A\} \cup \{ b\ |\ b \in B \}$
- Concatenation: $AB = \{ab\ |\ a \in A \wedge b \in B \}$
- Iteration: $
A^* = \bigcup_{i \geq 0} A^i,
\begin{cases}
A^i = A...A \\
A^0 = \epsilon
\end{cases}
$
Def. The regular expression over $\Sigma$ are the smallest set of expressions including:
$$
\begin{align}
R &= \epsilon \\
&|\ 'c', c\in\Sigma\\
&|\ R+R\\
&|\ RR\\
&|\ R^*
\end{align}
$$
Def. Let $\Sigma$ be a set of characters (an alphabet).
A language over $\Sigma$ is a set of strings of characters drawn from $\Sigma$.
Meaning function $L$ maps regular expressions to set of strings.
$$
\begin{align}
L(\epsilon) &= \{ "" \} \\
L('c') &= \{ "c" \} \\
L(A+B) &= L(A) \cup L(B) \\
L(AB) &= \{ab\ |\ a \in L(A) \wedge b \in L(B)\} \\
L(A^*) &= \bigcup_{i \geq 0} L(A)^i \\
\end{align}
$$
Q:Why use a meaning function?
- Make clear what is syntax, what is semantics.
- Allow us to consider notation as a separate issue.
- Because expression and meanings are not 1-1.
Lexical Specifications
Lexemes
- Keyword: "if" / "else" / "then" / ...
- $ 'if' + 'else' + 'then' + ...$
- Integer: a non-empty string of digits
- $digits: '0' + '1' + '2' + ...$
- $(digit)(digit)^* = digit^+$
- Identifier: strings of letters or digits, starting with a letter
- $letter = 'a' + 'b' + 'c' + ... = [a-zA-Z]$
- $(letter)(digit+letter)^*$
- Whitespace: a non-empty sequence of blanks, newlines, and tabs
- $('\;' + '\setminus n' + '\setminus t')^+$
More regular expressions
- At least one: $A^+ = AA^*$
- Union: $A|B = A+B$
- Option: $A? = A+\epsilon$
- Range: $'a'+'b'+\dots+'z' = [a-z]$
- Excluded Range: $\widehat{[a-z]} = [\wedge a-z]$
How do we check $program \in L(R)$?
- Write a regular expression for the lexemes of each token class
- Number $= digit^+$
- Keyword $= 'if'+'else'+\dots$
- Identifiew $= letter(letter+digit)^*$
- OpenPar $='('$
- ...
- Construct $R$ to match all lexemes.
- $R = Number + Keyword + Number + ... = R_1 + R_2 + ...$
- Let input be $x_1...x_n$. Find the longest length $i$ such that $x_1...x_i \in L(R)$.
- Remove $x_1...x_i$. Go to step 3.
- If $x_1...x_i \in L(R_a) \cap L(R_b)$, apply $R_{\min(a,b)}$.